12  R: Data structures

R has several data structures to store and manipulate data efficiently. These structures can be classified into homogeneous (same type) and heterogeneous (different types) categories.

12.1 Atomic vectors ((1D, Homogeneous))

Atomic vectors are the most fundamental data type in R. They are one-dimensional collections of elements, where all elements must share the same data type.

Since atomic vectors were covered in the previous chapter, this chapter will focus on the remaining three data structures: matrices, lists, and data frames.

12.2 Matrix ((2D, Homogeneous))

Matrices are two-dimensional arrays.

12.2.1 Creating a matrix

The built-in function matrix() is used to define a matrix. An atomic vector can be organized as a matrix by specifying the number of rows and columns.

For example, let us define a 3x3 matrix (3 rows and 3 columns) consisting of consecutive integers from 1 to 9.

mat <- matrix(1:9, 3, 3)
mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Note that the integers fill up column-wise in the matrix. If we wish to fill-up the matrix by row, we can use the byrow argument.

mat <- matrix(1:9, 3, 3, byrow = TRUE)
mat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

12.2.2 Getting matrix dimensions, the number of rows, the number of columns

Get matrix dimensions will give vector of c(rows, cols): - dim()

dim(mat)
[1] 3 3

Get number of elements: length()

length(mat)
[1] 9

The functions nrow() and ncol() can be used to get the number of rows and columns of the matrix respectively.

nrow(mat)
[1] 3
ncol(mat)
[1] 3

12.2.3 Matrix indexing and Slicing

Matrices can be sliced using the indices of row and column separated by a , in box brackets. Suppose we wish to get the element in the \(2^{nd}\) row and \(3^{rd}\) column of the matrix:

mat[2, 3]
[1] 6

For selecting all rows or columns of a matrix, the index for the row/column can be left blank. Suppose we wish to get all the elements of the \(1^{st}\) of the matrix:

mat[1, ]
[1] 1 2 3

Row and columns of the matrix can be sliced using the : operator. Suppose we want to select a sub-matrix that has elements in the first two rows and columns 2 and 3 of the matrix mat:

mat[1:2, 2:3]
     [,1] [,2]
[1,]    2    3
[2,]    5    6

12.2.4 Adding and Removing

  • use cbind() to add a column of correct length
  • use rbind() to add a row of correct length
  • if length is incorrect there will be a warning (no error it will still run), however it will “recycle” values.
  • When adding a row or a col, you CANNOT insert it between the existing rows/cols. However, you can re-arrange AFTER it is added using indexing.
  • use negative indexes to remove rows/columns
new_col <- c(7, 8, 9)

# add the column 
mat_new <- cbind(mat, new_col)
print(mat_new)
           new_col
[1,] 1 2 3       7
[2,] 4 5 6       8
[3,] 7 8 9       9
mat_new <- cbind(mat, new_col = 5)
print(mat_new)
           new_col
[1,] 1 2 3       5
[2,] 4 5 6       5
[3,] 7 8 9       5
new_row <- c(7, 8, 9)

# add the row 
mat_new <- rbind(mat, new_row)
print(mat_new)
        [,1] [,2] [,3]
           1    2    3
           4    5    6
           7    8    9
new_row    7    8    9
mat_new <- rbind(mat, new_row = 5)
print(mat_new)
        [,1] [,2] [,3]
           1    2    3
           4    5    6
           7    8    9
new_row    5    5    5
# remove the added row
mat_new <- mat_new[-4, ]
print(mat_new)
 [,1] [,2] [,3]
    1    2    3
    4    5    6
    7    8    9
# remove the added column
mat_new <- mat_new[, -4]
print(mat_new)
 [,1] [,2] [,3]
    1    2    3
    4    5    6
    7    8    9

12.2.5 Iterating

To iterate through each element you generally need a nested loop combined with nrow() and ncol()

for (i in 1: nrow(mat_new)){
  for (j in 1: ncol(mat_new)) {
    print(paste("Element at (", i, ",", j, "):", mat[i, j]))
  }
}
[1] "Element at ( 1 , 1 ): 1"
[1] "Element at ( 1 , 2 ): 2"
[1] "Element at ( 1 , 3 ): 3"
[1] "Element at ( 2 , 1 ): 4"
[1] "Element at ( 2 , 2 ): 5"
[1] "Element at ( 2 , 3 ): 6"
[1] "Element at ( 3 , 1 ): 7"
[1] "Element at ( 3 , 2 ): 8"
[1] "Element at ( 3 , 3 ): 9"

12.2.6 Element-wise arithmetic operations

Element-wise arithmetic operations have the same logic with atomic vectors, they can be performed between 2 matrices of the same shape.

mat1 <- matrix(1:6, 2, 3)
mat2 <- matrix(c(9, 2, 6, 5, 1, 0), 2, 3)
mat1 + mat2
     [,1] [,2] [,3]
[1,]   10    9    6
[2,]    4    9    6
mat1 - mat2
     [,1] [,2] [,3]
[1,]   -8   -3    4
[2,]    0   -1    6

Suppose we need to sum up all the rows of the matrix. We can do it using a for loop as follows:

row_sum <- c(0,0)
for (i in 1:nrow(mat)) {
  for (j in 1:ncol(mat)) {
    row_sum[i] <- row_sum[i] + mat[i, j]
  }
}
row_sum
[1]  6 15 NA

Observe that in the above for loop, elements of each row are added one at a time. We can add all the elements of a row simultaneously using the sum() function. This will reduce a for loop from the above code:

row_sum <- c(0,0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
[1]  6 15 24

In the above code, we sum up all the elements of the row simultaneously. However, we still need to sum up the elements of each row one at a time.

12.2.6.1 Matrix multiplication vs. element-wise multiplication

  • Matrix multiplication: using %*%
# Create another 3x3 matrix
mat2 <- matrix(9:1, nrow = 3)

# Perform matrix multiplication
result <- mat %*% mat2
print(result)
     [,1] [,2] [,3]
[1,]   46   28   10
[2,]  118   73   28
[3,]  190  118   46
  • Element-wise multiplication: using *
# Element-wise multiplication
mat * mat
     [,1] [,2] [,3]
[1,]    1    4    9
[2,]   16   25   36
[3,]   49   64   81

12.2.7 The apply() function

The apply() function can be used to apply a function on each row or column of a matrix. Thus, this function helps avoid the user to write a for() loop in R to iterate over all the rows and columns of the matrix. Note that each row / column of a matrix is an atomic vector. Thus, vectorized computations can be performed within the function, resulting in efficient computations.

Note that the apply functions use a for() loop under-the-hood, and thus the function will be applied sequentially on each row / column of the matrix. However, as the implementation of the for() loop is in C, it is likely to be faster than a for() loop in R.

Let us use the apply() function to sum up all the rows of the matrix mat.

apply(mat, 1, sum)
[1]  6 15 24

Let us compare the time taken to sum up rows of a matrix using a for loop with the time taken using the apply() function.

options(digits.secs = 6)
start.time <- Sys.time()
row_sum<-c(0, 0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
[1]  6 15 24
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.005076408 secs
start.time <- Sys.time()
apply(mat, 1, sum)
[1]  6 15 24
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.001475334 secs

Observe that the apply() function takes much lesser time to sum up all the rows of the matrix as compared to the for loop.

12.2.8 Misc useful functions:

Check if a value/string is in the matrix:

  • "val" %in% my_mat
# Create a matrix
fruit_mat <- matrix(c("apple", "banana", "cherry", "date"), nrow=2, byrow=TRUE)

# Check if "banana" is in the matrix
"banana" %in% fruit_mat 
[1] TRUE
# Check if "grape" is in the matrix
"grape" %in% fruit_mat  
[1] FALSE

If you want to return TRUE or FALSE directly:

any(fruit_mat == "banana")  
[1] TRUE
any(fruit_mat == "grape") 
[1] FALSE
which(mat == "banana", arr.ind = TRUE)
     row col

Check if a value is missing (ie: NA):

  • is.na(val)

12.3 Lists (1D, Heterogeneous)

Atomic vectors and matrices are quite useful in R. However, a constraint with them is that they can only contain objects of the same datatype. For example, an atomic vector can contain all numeric objects, all character objects, or all logical objects, but not a mixture of multiple types of objects. Thus, there arises a need for a list data structure that can store objects of multiple datatypes.

12.3.1 Creating and Indexing

A list can be defined using the list() function. For example, consider the list below:

list_ex <- list(1, "apple", TRUE, list("another list", TRUE))

The list list_ex consists of objects of mutiple datatypes. The length of the list can be obtained using the length()function:

length(list_ex)
[1] 4

A list is an ordered collection of objects. Each object of the list is associated with an index that corresponds to its order of occurrence in the list.

A single element can be sliced from the list by specifying its index within the [[]] operator. Let us slice the \(2^{nd}\) element of the list list_ex:

list_ex[[2]]
[1] "apple"

Multiple elements can be sliced from the list by specifying the indices as an atomic vector within the [] operator. Let us slice the \(1^{st}\) and \(3^{rd}\) elements from the list list_ex:

list_ex[c(1,3)]
[[1]]
[1] 1

[[2]]
[1] TRUE

12.3.2 Naming a list

Elements of a list can be named using the names() function. Let us name the elements of list_ex:

names(list_ex) <- c("Name1", "second_name", "3rd_element", "Number 4")

A single element can be sliced from the list using the name of the element with the $ operator. Let us slice the element named as second_name from the list list_ex:

list_ex$second_name
[1] "apple"

Note that if the name of the element does not begin with a letter or has special characters such as a space, then it should be specified within single quotes after the $ operator. For example, let us slice the element named as 3rd_element from the list list_ex:

list_ex$`3rd_element`
[1] TRUE

Names of elements of a list can also be specified while defining the list, as in the example below:

list_ex_with_names <- list(movie = 'The Dark Knight', IMDB_rating = 9)

12.3.3 Using as Function Input

No arithmetic or logical operations with a list, since elements are with different data types. A list can be converted to an atomic vector using the unlist() function. For example, let us convert the list list_ex to a vector:

unlist(list_ex)
         Name1    second_name    3rd_element      Number 41      Number 42 
           "1"        "apple"         "TRUE" "another list"         "TRUE" 

Since a vector can contain objects of a single datatype, note that all objects have been converted to the character datatype in the vector above.

Another example:

# Define a list with mixed data types
my_list <- list(a = 1, b = 2, c = "text", d = 4, e = TRUE)

# Function to sum numeric elements in the list
sum_list_numeric <- function(lst) {
  # Convert list to a vector, forcing non-numeric elements to NA
  vec <- unlist(lst)
  
  # Keep only numeric values (ignoring non-numeric ones)
  numeric_values <- as.numeric(vec[!is.na(as.numeric(vec))])
  
  # Compute sum
  return(sum(numeric_values))
}

# Call the function
result <- sum_list_numeric(my_list)
Warning in sum_list_numeric(my_list): NAs introduced by coercion
print(result) 
[1] 7

12.3.4 Applying a function to each element of a list: the lapply() function

# Define a list with numeric values
num_list <- list(a = 1:5, b = 6:10, c = 11:15)

# Use lapply() to compute the sum of each list element
sum_results <- lapply(num_list, sum)

# Print the result
print(sum_results)
$a
[1] 15

$b
[1] 40

$c
[1] 65
  • Using lapply() with an Anonymous Function
# Compute the mean of each element in the list
mean_results <- lapply(num_list, function(x) mean(x))

# Print the result
print(mean_results)
$a
[1] 3

$b
[1] 8

$c
[1] 13

12.4 Data Frames (2D, Heterogeneous)

A data frame is a table-like structure where each column is a vector. Unlike matrices, columns can have different types.

12.4.1 Creating data Frames

df <- data.frame(Name=c("Alice", "Bob", "Charlie"),
                 Age=c(25, 30, 35),
                 Score=c(90, 85, 88))

print(df)
     Name Age Score
1   Alice  25    90
2     Bob  30    85
3 Charlie  35    88

12.4.2 Accessing Data Frame Elements

df$Name   # Accessing a column
[1] "Alice"   "Bob"     "Charlie"
df[2, ]   # Second row
  Name Age Score
2  Bob  30    85
df[ , "Age"]  # Age column
[1] 25 30 35
df[1:2, c("Name", "Score")]  # Subset
   Name Score
1 Alice    90
2   Bob    85

12.4.3 Iterating

# Iterate over rows
for (i in 1:nrow(df)) {
  print(paste("Row", i, ":", df[i, ]))
}
[1] "Row 1 : Alice" "Row 1 : 25"    "Row 1 : 90"   
[1] "Row 2 : Bob" "Row 2 : 30"  "Row 2 : 85" 
[1] "Row 3 : Charlie" "Row 3 : 35"      "Row 3 : 88"     
for (col in names(df)) {
  print(paste("Column:", col))
  print(df[[col]])  # Access column by name
}
[1] "Column: Name"
[1] "Alice"   "Bob"     "Charlie"
[1] "Column: Age"
[1] 25 30 35
[1] "Column: Score"
[1] 90 85 88
for (i in 1:nrow(df)) {
  for (j in 1:ncol(df)) {
    print(paste("Element at [", i, ",", j, "]:", df[i, j]))
  }
}
[1] "Element at [ 1 , 1 ]: Alice"
[1] "Element at [ 1 , 2 ]: 25"
[1] "Element at [ 1 , 3 ]: 90"
[1] "Element at [ 2 , 1 ]: Bob"
[1] "Element at [ 2 , 2 ]: 30"
[1] "Element at [ 2 , 3 ]: 85"
[1] "Element at [ 3 , 1 ]: Charlie"
[1] "Element at [ 3 , 2 ]: 35"
[1] "Element at [ 3 , 3 ]: 88"

12.5 Summary

Data Structure Dimensions Homogeneous? Example
Atomic Vector 1D ✅ Yes c(1, 2, 3)
List 1D ❌ No list(name="Alice", age=25)
Matrix 2D ✅ Yes matrix(1:6, nrow=2)
Data Frame 2D ❌ No data.frame(name, age)

12.6 Practice exericises

12.6.1 Exercise 1

Recall the earlier example where we computed year’s in which the increase in GDP per capita was more than 10%. Let us use matrices to solve the problem. We’ll also compare the time it takes using a matrix with the time it takes using for loops.

G = c(3007, 3067, 3244, 3375,3574, 3828, 4146, 4336, 4696, 5032,5234,5609,6094,6726,7226,7801,8592,9453,10565,11674,12575,13976,14434,15544,17121,18237,19071,20039,21417,22857,23889,24342,25419,26387,27695,28691,29968,31459,32854,34515,36330,37134,37998,39490,41725,44123,46302,48050,48570,47195,48651,50066,51784,53291,55124,56763,57867,59915,62805,65095,63028,69288)

start.time <- Sys.time()

#Let the first column of the matrix be the GDP of all the years except 1960, and the second column be the GDP of all the years except 2021.
GDP_mat <- matrix(c(G[-1], G[-length(G)]), length(G) - 1, 2)

#The percent increase in GDP can be computed by performing computations using the 2 columns of the matrix
inc <- (GDP_mat[,1] - GDP_mat[,2]) / GDP_mat[,2]
years <- 1961:2021
years <- years[inc > 0.1]
years
[1] 1973 1976 1977 1978 1979 1981 1984
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.004027605 secs

Without matrices, the time taken to perform the same computation is measured with the code below.

start.time <- Sys.time()
years <- c()
for (i in 1:(length(G) - 1)) {
  diff = (G[i+1] - G[i]) / G[i]
  if (diff > 0.1) years <- c(years, 1960 + i)
}
print(years)
[1] 1973 1976 1977 1978 1979 1981 1984
#print(proc.time()[3]-start_time)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.007945538 secs

Observe that matrices reduce the execution time of the code as computations are performed simultaneously, in contrast to a for loop where computations are performed one at a time.

Sometimes, the computations on rows / columns of a matrix are not straighforward and we may need to use the apply() function to apply a function on each row / column of a matrix.

Example: Find the maximum GDP per capita of the US in each of the 5 year periods starting from 1961-1965, and upto 2015-2020.

GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_max_5year <- apply(GDP_5year, 1, max)

In the above code, we applied the in-built function max on all the rows. Sometimes, an in-built function may not be available for the computations to be performed. In such as case, we can write our own user-defined function within the apply() function. See the example below.

Example: Find the range (max-min) of GDP per capita of the US in each of the 5 year periods starting from 1961-1965, and upto 2015-2020.

GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_range_5year <- apply(GDP_5year, 1, function(x) max(x) - min(x))
GDP_range_5year
 [1]  761 1088 2192 3983 4261 4818 4349 6362 6989 2349 6697 7228

In the code above we applied a user-defined function on each row of the matrix. However, if the function has multiple lines, it may be inconvenient to write the function within the apply() function. In that case, we can define the function outside the apply() function.

Example: Find the five year periods starting from 1961-1965, and upto 2016-2020, during which the GDP per capita decreased as compared to the previous year.

GDP_inc <- function (GDP_5yr) {
  dec <- 0
  for (i in 1:4) {
    if(GDP_5yr[i+1] < GDP_5yr[i]) dec <- 1
  }
  return(dec)
}

GDP_5year_mat <- matrix(G[-c(1,length(G))], 12, 5, byrow = TRUE)
years_inc_dec <- apply(GDP_5year_mat, 1, GDP_inc)
five_year_periods <- seq(1960, 2015, 5)
print("Five year periods in which the GDP per capita decreased are those starting from the years:")
[1] "Five year periods in which the GDP per capita decreased are those starting from the years:"
print(five_year_periods[years_inc_dec == 1] + 1)
[1] 2006 2016

The 5 year periods during which the GDP per capita decreased as compared to the previous year are 2006-2010, and 2016-2020.

12.6.2 Exercise 2

Find the 5 year period in which the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest.

Solution:

five_year_periods[which.max(apply(GDP_5year_mat, 1, function(x) (max(x) - min(x)) / min(x)))] + 1
[1] 1976
print("During 1976-1980 the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest.")
[1] "During 1976-1980 the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest."

12.6.3 Exercise 3

The object country_names is an atomic vector consisting of country names. The object coordinates_capital_cities is a matrix consisting of the latitude-longitude pair of the capital city of the respective country. The order of countries in country_names is the same as the order in which their capital city coordinates (latitude-longitude) appear in the matrix coordinates_capital_cities.

Download the file capital_cities.csv from here. Make sure the file is in your current working directory. Execute the following code to obtain the objects coordinates_capital_cities and country_names.

capital_cities <- read.csv('capital_cities.csv')
coordinates_capital_cities <- as.matrix(capital_cities[,c(3, 4)])
country_names <- capital_cities[,1]

12.6.3.1 Country with capital closest to DC

Print the name and coordinates of the country with the capital city closest to the US capital - Washington DC.

Note that:

  1. The Country Name for US is given as United States in the data.
  2. The ‘closeness’ of capital cities from the US capital is based on the Euclidean distance of their coordinates to those of the US capital.

Hint:

  1. Get the coordinates of Washington DC from coordinates_capital_cities. The row that contains the coordinates of DC will have the same index as United States has in the vector country_names

  2. Create a matrix that has coordinates of Washington DC in each row, and has the same number of rows as the matrix coordinates_capital_cities.

  3. Subtract coordinates_capital_cities from the matrix created in (2). Element-wise subtraction will occur between the matrices.

  4. Use the apply() function on the matrix obtained above to find the Euclidean distance of Washington DC from the rest of the capital cities.

  5. Using the distances obtained above, find the country that has the closest capital to DC.

12.6.3.2 Top 10 countries closest to DC

  1. Print the names of the countries of the top 10 capital cities closest to the US capital - Washington DC.

  2. Create and print a matrix containing the coordinates of the top 10 capital cities closest to Washington DC.

US_index = which(country_names == 'United States')
dc_coord <- coordinates_capital_cities[US_index,]
distances_to_DC <- apply(coordinates_capital_cities, 1, 
                  function(city_coord) sqrt(sum((city_coord - dc_coord)**2)))
num_of_countries <- length(country_names)
distances_to_DC_matrix <- cbind(1:num_of_countries, distances_to_DC)
sorted <- distances_to_DC_matrix[order(distances_to_DC_matrix[,2]),]

Top 10 countries with capitals closest to Washington DC are the following:

country_names[sorted[3:12, 1]]

The coordinates of the top 10 capital cities closest to Washington DC are:

coordinates_capital_cities[sorted[3:12, 1],]

12.6.4 Exercise 4

Download the dataset movies.json. Execute the following code to read the data into the object movies:

library(rjson)
movies<-fromJSON(file = 'movies.json')

12.6.4.1

What is the datatype of the object movies?

class(movies)

The datatype of the object movies is list.

12.6.4.2

Count the number movies having a negative profit, i.e., their production budget is higher than their worldwide gross.

Ignore the movies that:

  1. Have missing values of production budget or worldwide gross. Use the is.null() function to identify missing or NULL values.

  2. Have a zero worldwide gross (A zero worldwide gross is probably an incorrect value).

negative_profit <- c()
count <- 0
for (i in 1:length(movies)) {
  pb <- movies[[i]]$`Production Budget`
  wg <- movies[[i]]$`Worldwide Gross`
  if (!(is.null(pb) | is.null(wg))) {
    if (pb > wg & wg > 0) {
      count <- count + 1
    }
  }
}
print(paste("Number of movies with negative profit =", count))

Solve practice exercise 5 without using a for loop. Use the lapply() function.

profit <- lapply(movies, function(x) x$`Worldwide Gross`-x$`Production Budget`)
positive_wg <- lapply(movies, function(x) x$`Worldwide Gross` > 0)
sum(profit < 0 & positive_wg > 0, na.rm = TRUE)